A Short Note on Use of Server Scripts to Access Functional Coupling Scores

A Short Note on Use of Server Scripts to Access Functional Coupling Scores

This short note will discuss the following command and what it accomplishes:


svr_all_features 83333.1 peg | svr_ids_to_figfams | svr_fc_figfams -MinSc 100 | svr_figfams_to_ids 83333.1 | svr_function_of > EC.data

It chains together 5 svr scripts, which I will now discuss.


   svr_all_features 83333.1 pegs

is used to generate a list of the features of type 'peg' that occur in the genome with ID 83333.1 (which is
Escherichia coli K12).  It produces output of the form

fig|83333.1.peg.1
fig|83333.1.peg.2
fig|83333.1.peg.3
fig|83333.1.peg.4
fig|83333.1.peg.5
fig|83333.1.peg.6
.
.
.

   svr_ids_to_figfams

takes an input file if which one column is PEG IDs (by default the last column).
It writes two files.  The file written to STDERR will be those lines of input in which the PEG was not included in any FIGfams.  All other lines are written to STDOUT, and they will be the input line with two fields appended: the family function of the FIGfam containing the PEG and the FIGfam ID.  Thus, the first few lines that are written to STDOUT would look like

fig|83333.1.peg.1       Thr operon leader peptide       FIG164298
fig|83333.1.peg.2       Aspartokinase (EC 2.7.2.4) / Homoserine dehydrogenase (EC 1.1.1.3)      FIG000885
fig|83333.1.peg.3       Homoserine kinase (EC 2.7.1.39) FIG000582
fig|83333.1.peg.4       Threonine synthase (EC 4.2.3.1) FIG000134
fig|83333.1.peg.5       hypothetical protein    FIG004675
fig|83333.1.peg.6       UPF0246 protein YaaA    FIG002158
.
.
.

The command
svr_fc_figfams -MinSc 100

takes an input stream in which one column (by default the last) contains a FIGfam ID. If the FIGfam does not show a tendency to co-occur (i.e., tends to occur within 5kb on  the chromosome) with another FIGfam, it produces no output.  However, if the FIGfam does tend to co-occur, a line will be written for every FIGfam that it tends to co-occur with. Two fields are added to the end of each output line: the number of OTUs in which the FIGfams co-occur and 
the FIGfam ID of the co-occurring FIGfam.  An OTU (for our purposes) is just a set of genomes  that are very close phylogenetic neighbors.  If you wish to get a list of the genomes that represent  distinct OTUs, just run

 svr_otus

If you wish to also see the set of genomes included in each OTU, use

         svr_otus | svr_members_of_otu

Anyway, the command 

svr_fc_figfams -MinSc 100

says "show me only FIGfams that occur close to the input FIGfam in at least 100 distinct OTUs". At this point in history (September, 2010), the SEED contains approximately 1000 distinct OTUs.


The command

svr_figfams_to_ids 83333.1 

takes an input stream (a tab-delimited table) that have a column containing FIGfam IDs (by default, the last column).  The command line arguments say "restrict output to members of these genomes (in this case to a single genome -- 83333.1)".  Foreach input line, a number of output lines will be written --
one for each PEG in the input FIGfam from the designated set of genomes).

Finally, I used

 svr_function_of

to tack on the functions of the PEGs listed in the last column.

I urger you to get used to thinking about this simple technique for extracting data. Used with the Unix tools of sort, cut, and grep they offer rich functionality.